Distance, dissimilarity index, and network community structure 
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We address the question of finding the community structure of a complex network. In an earlier 
effort [H. Zhou, Phys. Rev. E (2003)], the concept of network random walking is introduced and 
a distance measure defined. Here we calculate, based on this distance measure, the dissimilarity 
index between nearest-neighboring vertices of a network and design an algorithm to partition these 
vertices into communities that are hierarchically organized. Each community is characterized by 
an upper and a lower dissimilarity threshold. The algorithm is applied to several artificial and 
real-world networks, and excellent results are obtained. In the case of artificially generated random 
modular networks, this method outperforms the algorithm based on the concept of edge betweenness 
centrality. For yeast's protein-protein interaction network, we are able to identify many clusters that 
have well defined biological functions. 

PACS numbers: 87.10.-|-e,89.75.-k,89.20.-a 
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I. INTRODUCTION 



A graph (network) of vertices (nodes) and edges is a 
useful tool in describing the interactions between differ- 
ent agents of a complex system. For example if we want 
to analyze protein-protein physical interactions in yeast 
Saccharomyces cerevisiae [jj], we would like to denote 
each protein as a distinct vertex of a graph, and setup an 
edge between two vertices if the corresponding proteins 
have direct physical interactions. Many such kinds of 
networks are constructed in sociological, biological, and 
technological fields, and they usually have very compli- 
cated connection patterns. What one needs is a method 
that is capable of classifying vertices of a complex net- 
work into different clusters (communities). If a network 
is appropriately decomposed into a series of functional 
units, (a) the structure of the network can be better un- 
derstood and the relationship between its different com- 
ponents will be clear, (b) the principal function of each 
cluster can be inferred from the functions of its members, 
and (c) possible functions for members of a cluster can 
be suggested by comparing the functions of other mem- 
bers. Network clustering techniques are therefore very 
important in the emerging fields of bioinformatics and 
proteomics. 

A good clustering method needs to satisfy two condi- 
tions: First, the inherent structure of the network should 
be reserved; Second, it should provide a quantified reso- 
lution parameter to mark the significance of the clusters 
obtained at each level of the partitioning process. The 
global organization of a network should already be identi- 
fied at low resolutions and more and more fine structures 
emerge as the resolving power is increased. 

Many existing methods 0,0 only take account of local 
information of each vertex, such as number of nearest- 
neighbors shared with other vertices, number of vertex- 
independent paths to other vertices, etc.. Recently, Gir- 
van and Newman .j^ suggested an elegant global algo- 
rithm which extended the concept of vertex betweenness 
centrality of Freeman |5| also to edges. Their algorithm 



works iteratively by removing the current edge(s) of the 
highest degree of betweenness centrality. When apply- 
ing to an ensemble of random modular networks, this al- 
gorithm greatly outperforms some conventional methods 
4]. On the other hand, it does not provide a parameter 
to quantify the differences between communities. 



In reference ^6] a Brownian particle is "introduced" 
into a network to "measure" the distances between ver- 
tices. In the present work, we extend the basic idea 
of 6] by defining, based on this distance matrix, a 
quantity called the dissimilarity index between nearest- 
neighboring vertices. The dissimilarity index signifies to 
what extent two nearest-neighboring vertices would like 
to be in the same community. A hierarchical algorithm 
is then worked out; it takes use of information on the 
dissimilarity indices and decompose a network into a hi- 
erarchical sequence of clusters. Each of the communities 
is characterized by an upper and a lower dissimilarity 
threshold. 



The method, which could work on unweighted as well 
as weighted networks, is applied to several artificial and 
real networks, and very satisfying results are obtained. 
For the case of random modular networks, the present al- 
gorithm outperforms the method of Girvan and Newman 
y . When applying the algorithm to the protein-protein 
interaction network of yeast, we are able to identify many 
protein clusters which have well defined biological func- 
tions. 



In section Ull we review the distance measure of refer- 
ence and define a dissimilarity index for each pair of 
nearest-neighboring vertices. A dissimilarity-index-based 
hierarchical algorithm is outlined in section UTTI and ap- 
plied to two kinds of artificially generated networks and 
four real- world networks in section Hvl We conclude our 
work in section Ivl 
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II. DISTANCE MEASURE AND 
DISSIMILARITY INDEX 

In the opinion of Flake, Lawrence, and Giles a 
community in a (sub)graph should satisfy the require- 
ment that each vertex's total intra-community interac- 
tion be stronger than the total interaction with other 
vertices in the (sub)graph. This turns out to be a very 
strong constraint. In this work, we weaken this condi- 
tion and require only that a vertex should have stronger 
total interaction with other vertices of its own commu- 
nity than with vertices of any another community of the 
(sub)graph. 

We consider a connected network of N vertices and 
M edges. The network's connection pattern is specified 
by the generalized adjacency matrix A. We assume that 
the value of each non-zero element of matrix A (say Aij) 
denotes the interaction strength between vertex i and j. 
The distance, dij, from vertex i to vertex j is defined as 
the average number of steps needed for a Brownian parti- 
cle on this network to move from vertex i to vertex j Q . 
At each vertex (say k) the Brownian particle will jump in 
the next step to a nearest- neighboring vertex (say I) with 

probability Pki — ^fci/X]m=i ^km- The distance matrix 
thus defined is asymmetric (in general dij ^ dji), and it 
is calculated by solving N linear-algebraic equations |^ . 

Taking any vertex i as the origin of the network, then 
the set {dii, • • ■ , di^t-i, di^i+i, • ■ • , dijv} measures how far 
all the other vertices are located from the origin. There- 
fore it is actually a perspective of the whole network with 
vertex i being the viewpoint. Suppose vertex i and j are 
nearest- neighbors {Aij > 0), the difference in their per- 
spectives about the network can be quantitatively mea- 
sured. We define the dissimilarity index, A{i,j), by the 
following expression: 



If two nearest-neighboring vertices i and j belong to 
the same community, then the average distance dik from 
i to any another vertex k {k ^ will be quite similar 
to the average distance djk from j to k, therefore the net- 
work's two perspectives (based on i and j, respectively) 
will be quite similar. Consequently, A(«,j) will be small 
if i and j belong to the same community and large if they 
belong to different communities. 



III. THE ALGORITHM 

We exploit the dissimilarity index to decipher the com- 
munity structure of a network. After the distance ma- 
trix {dij} and the dissimilarity indices for all the nearest- 
neighboring vertices {A{i, j)} are obtained, the algorithm 
works as follows: 



1. Intially the whole network is just one single commu- 

nity. This community is assigned an upper dis- 
similarity threshold ^upp equalling to the maximum 
value of all the different dissimilarity indices. 

2. For each community, a resolution threshold parameter 

is introduced and is assigned the initial value Oupp 
of that community. The algorithm is unable to dis- 
criminate between two nearest-neighboring vertices 

1 and j when A(i,j) < 6; if this happens, vertices i 
and j are marked as "friends" . 

3. The 9 value is decreased differentially. All edges in 

the community are examined to see whether two 
nearest-neighboring vertices are friends. Different 
friends sets of the community are then formed, each 
of which contains all the friends of the vertices in 
the set. There may also be vertices in the com- 
munity that do not have any friends. Each of 
these vertices is moved to the friends set that has 
the strongest interaction with it. After this op- 
eration, vertices of the community are distributed 
into a number of disjointed sets (this number may 
be unity). 

4. Each vertex in a subcluster should have stronger in- 

teraction with vertices within this subcluster than 
with vertices of any another subcluster of this com- 
munity. To fulfill this requirement, we perform a 
local adjustment process: move each of the vertices 
that fail to meet this requirement to the friends 
set that has the strongest total interaction with it. 
This adjustment process is performed simultane- 
ously for all these unstable vertices and is repeated 
until no unstable vertices remains. 

5. If vertices of the community remain together, the al- 

gorithm returns to step 3. If they are divided into 
two or more sets, then the community under pro- 
cessing is assigned a lower dissimilarity threshold 
^low equalling to the current 6 value, and it is no 
longer considered. Each of the identified subsets of 
this community is regarded as a new (lower-level) 
community, with upper dissimilarity threshold ^upp 
equalling to the current 9 value. The algorithm 
returns to step 2 to work with another identified 
community. 

6. After all the (sub)communities are processed, a den- 

drogram is drawn to demonstrate the relationship 
between different communities as well as the up- 
per and lower dissimilarity thresholds of each com- 
munity. The vertex set of each community is also 
reported. 

The above procedure could be easily implemented with 
C++ programming language. The source code as well as 
the data for the examples studied in the following section 
will be made publicly available j8|- 
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IV. APPLICATIONS 

We test the performance of the above-mentioned algo- 
rithm by applying it first to two kinds of artificial net- 
works and to three real-world networks. 



A. Artificial random modular networks 

To quantitatively compare with the work of Girvan and 
Newman 0| the algorithm is first applied to a random 
modular network. The network has 128 nodes, which are 
divided into 4 modules of size 32 each. Each vertex has 
on average 16 edges connecting to other vertices, and on 
average Jout of each vertex's edges are to vertices of other 
modules. All the edges are setup randomly with these 
two fixed expectation values. The present method is able 
to recover the modular structure of the network up to 
Zout — 7. It slight outperforms the method of Girvan and 
Newman Q in performance. For example, working on an 
ensemble of random graphs with Zout = 6.0 by the present 
method, on average only 4.5 vertices are misclassified, 
each of which is assigned a cluster identity different from 
those of the majority of vertices of its module; while on 
average about 13 vertices are misclassified by the method 
of Girvan and Newman 

In figure the community structure of a randomly 
generated modular network with ^out = 6.0 is demon- 
strated. When the resolution threshold is beyond 0.323, 
the network as a whole could be regarded as a giant com- 
munity. At resolution threshold 0.323, however, 3 sub- 
groups suddenly emerge, with size 32, 32, and 64, respec- 
tively. The first two communities correspond to two mod- 
ules of the network, and the last one is the merge of the 
other two modules. At resolution threshold 0.319, this 
later community again is divided into two subcommuni- 
ties of 32 vertices each, corresponding to the remaining 
two modules. At resolution threshold 0.258, one of the 
modules of the network is found to fission into two sub- 
groups of size 14 and 18, respectively. In this example, 
the designed four modules of the network correspond to 
the resolution range from 0.258 to 0.319. 

How to interpret the resolution parameters in the den- 
grograms such as that shown in figure nj" Take module 
2 and module 3 as examples. Figure ^ suggests that 
edges between these two modules have dissimilarity in- 
dices larger than 0.323, while edges within these mod- 
ules have dissimilarity indices ~ 0.227. Therefore there 
is a large dissimilarity gap of about 0.1 between an inter- 
modular edge and an intra-modular edge. 

It is noticeable that by the present algorithm, each 
community has certain range of stability. Subcommuni- 
ties emerge only when the resolution threshold is lowered 
below certain level, and they emerge abruptly. 




.35 .3 .25 

resolution 



FIG. 1: The community structure of a random modular net- 
work of 128 vertices and 1067 unweighted edges (see the main 
text for the rules how such a network is generated) . Here and 
in following figures, in the pattern xx-yy, the number yy after 
the hyphen denotes the group-identity of vertex xx according 
to information from other sources. 
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B. Regular hierarchy networks 

We analyze here the community structure of the model 
hierarchy network studied by Ravasz and coauthors 
The network is constructed hy several steps 0: At level 
n = 0, a fully connected unit of four vertices is gener- 
ated. At level n = 1, three replicas of this unit are added 
and the external vertices of these replicas are connected 
to the central vertex of the n = unit, while the central 
vertices of the replicas are connected to each other. This 
replication-connection process could be continued to any 
desired level n. In figure|2K such a network at level n = 2 
is shown. It was remarked 3j that conventional network 
clustering methods are unable to uncover the hierarchi- 
cal structure of such a network. The present method, 
however, works very well: figure |33 demonstrates the 
obtained community structure of the network figure 
The hierarchy organization of the vertices in the network 
is largely reserved in|33. At resolution threshold 1.95 the 
network is divided into 4 subgroups of size 3 and a gi- 
ant component of size 62. Later at resolution threshold 
1.89, this giant component again is fissioned into 2 parts: 
one part has size 12 and is further divided into 3 sub- 
groups of size 4 at resolution threshold 1.52; the other 
part has size 50, which, at resolution threshold 1.53 fur- 
ther decomposes into 3 subgroups of size 13, 13, and 14, 
respectively. At resolution threshold 0.91, each of these 
three subgroups is further divided into 3 subgroups. 



C. The karate club network 

The karate club data examined in references P| and 
is re-evaluated here. This network is weighted, each 
edge is assigned a different strength. The present algo- 
rithm leads to the community structure of figure 13 At 
resolution threshold 1.67 the network decomposes into 
one small component of 5 vertices and a large compo- 
nent of 29 vertices. At resolution threshold 0.87, this 
large component further decomposes into two subgroups. 
One of which has 18 members and the other has 11 mem- 
bers. Comparison with the actual fission pattern is also 
shown in figure 13 



D. The foot-ball team network 

The foot-ball team network compiled by Girvan and 
Newman and studied in references 0| and 6] is re- 
investigated here. The present method results in the 
community structure of figure 01 Each vertex's actual 
group-identity is also shown for comparison. In the reso- 
lution region between 0.41 and 0.64 there are 12 commu- 
nities according to the present algorithm. Of the 12 ac- 
tual groups, only members from group- 12 are distributed 
to other groups (with good reasons, because actually 
there are very few direct interactions between the five 



members of this cluster). Vertex 111 are classified to- 
gether with members of group-11, we have checked that 
this vertex has 8 edges linking to group-11 and only 3 
edges to other groups. Vertex 59 is classified together 
with members of group-9, we have also checked that it 
has stronger interaction with group-9 than with any an- 
other group. 

The organization of the different teams suggested by 
the present algorithm seems to be even better than the 
original organization. 



E. The scientific collaboration network 

The scientific collaboration network compiled by Gir- 
van and Newman and examined in references |3| and 
is also re-examined. This network is also weighted. 
The present method suggests a community structure 
shown in figure |S1 In accordance with the actual situ- 
ation, on the global scale, the network clearly has 3 giant 
communities of comparable sizes. Each of these giant 
communities could further be decomposed into several 
subcommunities when the resolving power is increased. 



F. The protein interaction network of yeast 

The protein interaction network of yeast is constructed 
based on the data reported in references ^^'^ E3' 
contains 1471 proteins and 2770 edges (protein-protein 
physical interactions). This network has already been 
studied in reference here we constructed a reduced 
interaction network based on the original one. First, self- 
connection is removed; second, proteins which are con- 
nected to the network by only one edge are removed. The 
second step is continued until no proteins of degree one 
remains. The reason to remove all the proteins of degree 
one is that, according to the idea of Girvan and Newman 
^4j, a vertex that is connected to the network by just one 
edge should be in the same community as its nearest- 
neighboring vertex, therefore its status need not to be 
considered separately. Of cause, we have checked that 
actually identical results are obtained when the network- 
reduction process is not performed. The reduced network 
contains 871 proteins and 2043 unweighted interactions 
(edges). 

The community structure of this network is demon- 
strated in figure |S1 It seems to be strikingly different 
from those of the other networks studied in this paper. 
At the resolution range between ~ 1.5 to 18.0 many 
small communities appear, but the network is dominated 
by just one large cluster of size proportional to the to- 
tal size of the network. This is in accordance with ref- 
erence where the original network was decomposed 
into one large component and several small components. 
When the resolution threshold is decreased below 1.5, 
the largest cluster is divided into several subclusters of 




FIG. 2: A hierarchy network Q at level n = 2 (A) and its community structure (B). 
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FIG. 3: The community structure of the karate club network 
of Zachary 0. 



comparable sizes. The biological significance of such a 
community structure is yet to be investigated. 

Based on the community structure shown in figure El 
we can construct clusters of proteins that might be of 
biological significance. Here we just show three exam- 
ples of such protein clusters, corresponding respectively 
to higher, medial, and lower resolution thresholds. 

The first example is a cluster which appears at reso- 
lution threshold 18.04. It contains 16 proteins and 33 
edges, and has the structure shown in figure [7|^. This 
cluster is stable, namely that each vertex in it is more 
connected to vertices in this cluster than to vertices out- 
side; and it has no further subcommunity structure. Ac- 
cording to the protein interaction databank 15 
of these proteins are all involved in ATP synthesis pro- 
cess in yeast. They may form a very important part of 
yeast's mitochondrial ATPase complex. One protein of 
this cluster, YIL124W, is a hypothetical membrane pro- 
tein. Because this last protein has only one interaction 
with other members of the cluster, it may not have sim- 
ilar biological functions as the other members. 

The second example is a cluster which appears at 
resolution threshold 5.11. It contains 11 proteins and 
38 edges, and has the structure shown in figure [7|3. 
This cluster is also stable and has no further struc- 
ture. According to the protein interaction databank 
[13, 113 1 among these 11 proteins, YBL084C, YFR036W, 
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FIG. 5: The community structure of the scientific coUabora- 
tion network of Girvan and Newman 0. 



YHR166C, YKL022C are known to be cell division con- 
trol proteins; YGL240W plays a role in cell cycle and mi- 
tosis; YDR118W, YNL172W, YDR249C probably are mem- 
brane proteins; and YLR127C, YDL008W, YLR102C are 
hypothetical proteins whose functions remain to be de- 
termined. It is quite likely that all the proteins in this 
cluster are closely involved in cell division and membrane 
fission process. We anticipate that the three hypotheti- 
cal proteins of this cluster will also have similar biological 
functions. 

The third example is a cluster which appears only 
when the resolution threshold is refined to below 0.88. 
It contains 14 proteins and 41 protein-protein inter- 
actions. This cluster is also stable and has no fur- 
ther structure. The interaction pattern of this clus- 
ter is demonstrated in figure CP . Among these 14 pro- 
teins, according to the protein interaction databank 0, 
IT^ . YCR093W, YPR072W, YDL165W, YER068W, YIL038C 
are general negative regulator of transcription subunits; 
YAL021C is a glucose- repressible alcohol dehydrogenase 
transcriptional effector; YNR052C is a ubiquitous tran- 
scription factor; YDR443C, YGR104C are suppressors of 
RNA polymerases; YNL025C is the RNA polymerase II 
holocnzyme cyclin-like subunit; YPL042C is the meiotic 
mRNA stability protein kinase UME5; YGR092W is the cell 
cycle protein kinase DBF2; and YKR036C and YFL028C are 
two hypothetical proteins. It is quite likely that this clus- 
ter is mainly involved in RNA transcription process and 
we also anticipate that the two hypothetical proteins of 
this cluster are strongly related with this biological func- 
tion. 

To conclude this subsection, we stress that, based on 
the community structure of figure|n|many clusters of pro- 
teins can be constructed. Here we have mentioned just 
three examples. These identified protein clusters could 
help researchers to assign possible biological functions 
to hypothetical proteins, and could also suggest possible 
proteins that may be involved in carrying out a particular 
biological reaction. 



V. CONCLUSION AND DISCUSSION 

In our earlier work jQ], the distance between two ver- 
tices of a graph is defined as the average number of steps a 
Brownian particle takes to move from one vertex to the 
other. Based on this distance measure, in the present 
work we define a dissimilarity index to signify to what 
extent two nearest-neighboring vertices will be different 
from each other. We observe that vertices belonging to 
the same group usually have very small dissimilarity in- 
dices between them, while vertices of different communi- 
ties usually have large dissimilarity indices between them. 
The observation leads naturally to an algorithm of net- 
work clustering. We applied this method to several ar- 
tificial networks and also to different real networks in 
social and biological systems and satisfactory results are 
obtained. Different clusters of a network obtained by our 
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FIG. 6: The community structure of the reduced protein-protein interaction network of yeast. 
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method are characterized by a range of resolution thresh- 
old. 



The examples studied by us in this paper suggest that 
our algorithm is very promising in identifying the com- 
munity structure of a complex networked system. Why 
it works? Maybe it is because of the following reasons. 
First, the vertex-vertex distance measure has taken into 
account the topological structure of the network as well 
as the local connections of the network. The distances 
from one vertex to all the other vertices of the network 
actually give a perspective of the whole network viewed 
from this vertex. Second, the dissimilarity index defined 
by equation ^ compares the perspectives viewed from 
two nearest-neighboring vertices. It is intuitively appeal- 
ing to assume that the perspectives of the different ver- 
tices of the same community are similar to each other 
while those of vertices of different communities will be 
quite different. 



It is anticipated that the present work will find appli- 
cations in the field of complex networks, as well as in the 
fields of sociological and biological sciences. 
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